Transcriptome-guided annotation and functional classification of long non-coding RNAs in Arabidopsis thaliana

Long non-coding RNAs (lncRNAs) are a prominent class of eukaryotic regulatory genes. Despite the numerous available transcriptomic datasets, the annotation of plant lncRNAs remains based on dated annotations that have been historically carried over. We present a substantially improved annotation of Arabidopsis thaliana lncRNAs, generated by integrating 224 transcriptomes in multiple tissues, conditions, and developmental stages. We annotate 6764 lncRNA genes, including 3772 that are novel. We characterize their tissue expression patterns and find 1425 lncRNAs are co-expressed with coding genes, with enriched functional categories such as chloroplast organization, photosynthesis, RNA regulation, transcription, and root development. This improved transcription-guided annotation constitutes a valuable resource for studying lncRNAs and the biological processes they may regulate.

www.nature.com/scientificreports/ Several databases store and classify plant lncRNAs 3,38,39,41 . Among these, we wish to highlight the CANTATAdb v2.0 database, which contains 4080 lncRNA genes 41 . The annotations in CANTATAdb are based on ten A. thaliana transcriptomes and a robust annotation methodology, including identifying lncRNAs using the Coding Potential Calculator (CPC) 47 . Another important database is GreeNC 38,48 , which also uses a predictive annotation through CPC to identify lncRNAs in different species based on transcripts available in Phytozome 49 and ENSEMBL 50 , including 2752 genes in A. thaliana. In addition, it classifies lncRNAs that can function as miRNA precursors 38 . The most widely used lncRNA reference annotation is Araport11 40 . Araport11 has 3559 lncRNA genes (2444 lincRNAs and 1115 NATs) 40 . While coding gene annotations in Araport11 arise from the integrative annotation pipeline analysis of 113 RNA-seq experiments on different tissues from plants grown under various conditions, the lncRNAs annotated in Araport11 arise from various sources. In particular, it combines the annotations mentioned above of lincRNAs from 4,44 and the NAT annotations from 44 with lncRNAs well annotated in literature (e.g., FLINC and COOLAIR) 12,31 . Thus, the lncRNA annotation process in Araport11 was nowhere nearly as strict as their approach to annotating protein-coding genes.
Despite these multiple available sources of annotated plant lncRNAs, few of them have been experimentally characterized or assigned a possible function. A commonly used approach to assign a biological function to lncRNAs is the so-called "guilt-by-association" strategy 51,52 . This involves generating gene co-expression networks and their subsequent functional annotation to assign potential biological functions to lncRNA genes 51,52 . Co-expression networks represent the similarity between the expression patterns of different genes in a set of conditions, developmental stages, and tissues 53 . Genes co-regulated in a wide array of biological conditions are likely controlled by the same regulators or may participate in the same or related biological function or process 52,[54][55][56] . This idea underlies "guilt-by-association" approaches, as lncRNAs can be assumed to work concurrently with the genes it is expressed with, and it is thus preemptively assigned the functions of the genes within its co-expression group. For this approach to work, multiple transcriptomes of the same organism in different stages of development, tissues, and various types of stress are required 53,57,58 . The more transcriptomes used, the better the statistical significance of the co-expression relationship between genes becomes. Furthermore, the diversity of transcriptomes makes it possible to identify specific networks for a condition or tissue and general networks 54 . In plants, co-expression networks have been successfully used for the identification of functions in both coding genes [59][60][61][62][63] and, more recently, in lncRNAs 6,[64][65][66][67][68][69] .
To address the need for a better annotation of lncRNAs in A. thaliana, we leverage the numerous publicly available RNA-Seq datasets to carry out a comprehensive reannotation of lncRNAs in A. thaliana. We reanalyzed 220 publicly available RNA-Seq datasets, in addition to four seedling transcriptomes generated in-house. Furthermore, we integrate these better annotated and expanded lncRNAs within gene co-expression networks, which enable us to identify potential functions.

Methods
Publicly available transcriptomes used. We selected 220 publicly available transcriptomes using the following criteria: (1) a minimum of 0.5 gigabases (GB) per transcriptome, and (2) generated in a condition, tissue, or developmental stage of wild-type Col-0 A. thaliana. These included: embryo, seed, hypocotyl, cotyledon, root tip, shoot apical meristem (SAM), seedling, root, plant callus, petiole, leaf, carpel, flower pedicel, petal, pollen, sepal, stamen, flower, stem internode, stem node, septum, valve, whole adult plant and conditions such as cold, heat, salinity, drought, blue light, red light, limited phosphate, limited iron and presence of abscisic acid (ABA). All transcriptomes were downloaded as raw reads from Gene Expression Atlas (GEA) 70 and Gene Expression Omnibus (GEO) 71 . Each dataset is described in detail in Table 1. Additionally, we generated four transcriptomes from the aerial part and roots of A. thaliana 8 day post-germination seedlings (see details below), totaling 224 transcriptomes (Dataset S3).
In-house transcriptome generation. Seedlings were grown A. thaliana in Murashige and Skoog (MS) solid medium within growth chambers under conditions of long days (21 °C, 16/8 h photoperiod cycles), approximately for 8 days. The aerial part (shoot) and roots were collected separately, with two biological replicates for each organ (fully open cotyledons and 2 rosette leaves greater than 1 mm long). Total RNA was extracted using TRIzol (Invitrogen, 15,596,018), and according to company specifications, samples were DNase I treated using TURBO™ DNase (Invitrogen, AM2238). The quality and concentration of the samples were measured using the NanoDrop 2000C spectrophotometer (Thermo Fisher Scientific Inc). The integrity of the RNA was verified using a 1.5% agarose gel, and the mRNA was enriched using the NEBNext Poly (A) mRNA Magnetic Isolation protocol (NEBNext, E7490S). The libraries were prepared using the NEBNext Ultra II Directional RNA library kits (NEBNext, E7760S) and NEBNext Multiplex oligos for Illumina (SET 1) (NEBNext, E7335). The libraries were sequenced using the Hi-Seq X from Illumina, using 2 × 150 nt (PE150). The depth and characteristics of these libraries are summarized in Table S1. All the experiments were performed in accordance with relevant guidelines and regulations.
Filtering, assembly, and quantification of transcripts across all transcriptomes. We assessed the quality of all transcriptomes using FastQC v0.11.2 72 and MultiQC v1.0 73 . Low-quality reads and adapters were removed using Trimmomatic v0.32 (HEADCROP:10-5 LEADING:5 SLIDINGWINDOW:4:15 MINLEN:30-60) 74 . All quality filter reads were aligned to the A. thaliana TAIR10 genome 75 , using STAR v2.7.2.b (-alignMatesGapMax 120,000) 76 . The resulting alignments were assembled using StringTie v1.3.4 (− f 0.3 − m 50 − a 10 − j 15 − c 2.5) 77 , using the Araport11 annotation as a reference 40  lncRNA identification. To identify the lncRNAs, we first generated the amino acid sequence for all transcripts using TransDecoder v5.3.0 79 . We then applied nine sequential filters based on previous studies 5,9 (see Fig. S1). We refer to this process as the Strict Method (SM). First, (1) we selected all autosomal transcripts ≥ 200 nt using the infoseq program of EMBOSS v6.6.0 80 . We eliminated sequences whose translated ORF or nucleotide sequence had homology to proteins in the Uniprot database 81 as measured by the (2) blastp (e-value ≤ 1e−6) or (3) blastx (e-value ≤ 1e−6, strand = "plus") program, respectively 82 . We subsequently removed sequences with (4) identifiable protein domains found in the base of Pfam (v33.0) 83 using the HMMER v3.1b2 program 84 (e-value ≤ 1e−6) or (5) with identifiable signal peptides using signalP v4.1 85 (D-cutoff: 0.45). For any reminder sequences, (6) we removed those that had an ORF > 100 aa using the program getorf of EMBOSS v6.6.0 80 . We did an additional filtering step of all sequences with homology to non-redundant proteins (nr) annotated in the NCBI database 85,86 using BLASTx 82 (evalue ≤ 1e−6, strand = "plus"). For each remaining transcript, we identified the best blast hit against the 'nr' database with a percentage of identity above 70% (pident ≥ 70.000). For each best hit, we used the blastdbcmd function 82 to obtain the information related to the ID. The transcripts annotated in NCBI as: "hypothetical protein" (in Refseq), "similar to" (NCBI's annotation pipeline), "putative protein", "unknown (unknown protein, unknown, partial, unknown)", "predicted protein" and "unnamed protein product" 87 were retained. tRNAs and rRNAs were identified using infernal v1.1.2 88 and the covariance models in the Rfam database 89 . We additionally compared sequences with tRNAs and rRNAs reported in A. thaliana using BLASTn 82 (evalue ≤ 1e−6, strand = "plus"). All sequences identified as tRNAs or rRNAs were discarded. Finally, we eliminated transcripts with introns > 6000 bp. After filtering, we manually reviewed transcripts classified in Araport11 40 as coding proteins or genes and in our annotation as lncRNAs. This manual review consisted of verifying if these genes had annotation as functional proteins or annotated domains; in these cases, the lncRNA was discarded; if it was a hypothetical or not described protein, the lncRNA was retained. Thus, all sequences that passed this final review constituted the final set of SM lncRNAs.
Classification of lncRNAs by genomic position. LncRNAs are generally classified by their positional relationship to other genes. We used the following non-overlapping categories, based on the GENCODE annotation 1 : (1) Intergenic lncRNAs (lincRNAs) lncRNAs found in intergenic regions.  www.nature.com/scientificreports/ We further classified lncRNAs by their expression level, considering all lncRNAs with an expression level of fewer than 3 transcripts per million (TPM) in one transcriptome as Low Confidence (SM LC). The remaining lncRNAs were classified as High Confidence (SM HC).
It is worth mentioning that all the isoforms of the overlapping gene are considered for all these categories. To know with which genes our lncRNAs overlap, we used the annotation of Araport11 40  Coding potential assessment. CPAT (3.04) 92 was used to estimate the sequence-based coding potential of all transcripts as an additional validation method. CPAT is a framework designed for the alignment-free analysis of coding potential in a transcript context, using statistical analysis of relative kmer-frequencies as its basis. Transcripts with known annotation were used to create a Hexamer frequency model and a Logistic regression model using the make_hexamer_tab.py and make_logitModel.py scripts of the CPAT software framework. The relative reliability of the Regression model was estimated by analysis of the associated ROC curve, yielding an area under curve (ROC) score of 0.968.
Transcript coding potentials were predicted for each sequence using this model and results grouped in one of four categories: coding (coding potential > 0.5), non-coding (coding potential ≤ 0.5), high confidence coding (coding potential ≥ 0.9) and high confidence non-coding (coding potential ≤ 0.1).

in R.
Quantification of lncRNAs by tissue and stage. The transcriptomes were divided into tissue and developmental stage categories based on their age and tissue of origin. Notably, some categories are not bona fide tissues (e.g. whole plant, seedlings). However, these were considered their own category as these transcriptomes can be readily differentiated from others. All the transcriptomes were classified into five developmental stages based on the classification by 94 (Fig. 2b). The first two stages belong to the vegetative phase and include: seed germination (Stage 1, 3 to 5 days old) and leaf development (Stage 2, 6 to 25 days old); the rest of the stages are part of the reproductive phase, ranging from the presence of the first inflorescence (at 26 days old) (Stage 3, 26 to 29 days old), flower production (Stage 4, 30 to 47 days old), to the generation of siliques (Stage 5, 48 to 51 days old) ( Table 1, Fig. 2b).
To identify lncRNAs specific to a tissue or stage of development, we calculated the value of the tissue specificity index Tau 95 . The calculated Tau values range from 0 to 1 where genes that are tissue or stage-specific have values close to 1 (Fig. S2, Dataset S4). Only genes with Tau values higher than the median Tau value of mRNAs (0.54) were considered tissue-specific or developmental stage-specific (Dataset S4).

Generation of coding and non-coding gene co-expression networks.
To determine the possible functions of lncRNAs, we used a guilty-by-association approach. This approach identifies enriched functional annotations of protein-coding genes co-expressed with the lncRNAs, which allows inferring the biological processes in which these lncRNAs may be involved. The co-expression network was built using the WGCNA (1.69) package 96 based on the table of raw counts for the full transcriptome normalized using the variance stabilizing transformation (VST), part of the DESeq2 (1.28.1) package 97 . The adjacency function was weighted by the power of correlation between the different genes, and the law of free-scale networks determined the parameter β. To ensure that the average connectivity of the network was continuous, we chose a value of β = 12, which is the lowest value for which the unscaled topology index curve remains stationary (Fig. S3). From this point on, we will refer to the groups of co-expressed genes as co-expression modules or simply modules, following the nomenclature used by the WGCNA program 64 . The network was of type signed with a bicor correlation (biweight midcorrelation) and the option of separate modules (unmerged) with a minimum module size of 50 genes. The expression profiles were represented by their main component (module eigengene). An eigengene is the first right-singular vector of the standardized gene expression 98 that serves as a summarized representation of the expression of all genes in each module. To identify the functions associated with each co-expressed module, we performed an enrichment analysis of Gene Ontology (GO.db_3.11.4) categories using topGO (2.40.0) 99 and the genome-wide annotation of Arabidopsis (org.At.tair.db) as background for the Biological Process (BP) ontology. Finally, we used a Fisher test correcting for multiple testing (Benjamini-Hochberg) (qval.bh < 0.01, FDR < 1%) to assess the significance of the enrichment of GO categories. ReviGO (rrvgo v 1.6.0) was used to summarize and remove redundant GO terms and visualized using treemap v2.4-6 R library.
Genome browser. All lncRNA annotations were uploaded to the UCSC Genome Browser as a track for visualization 91 . The coordinates of all lncRNAs genes and their classification are available in Dataset S1.

Results
Using the SM, we identify 6764 lncRNA genes (7070 transcripts). These included 4354 lincRNAs (4548 transcripts), 2060 NATs (2133 transcripts), 213 sense-exonic (248 transcripts), and 185 intronic (187 transcripts) (Fig. 1a, Dataset S1), 78 intronic lncRNAs had no transcriptional orientation (sense) as they were identified in single-end transcriptomes only. Furthermore, 33 lncRNA genes (46 transcripts) were categorized as both NATs and sense-exonic due to the position of the lncRNA flanked by both sense and antisense coding genes in the DNA strand. These were manually verified to ensure they were not extended 3′ UTRs of overlapping proteincoding genes. Additionally, 15 genes had isoforms belonging to different categories (Dataset S2). To provide a measure of the observed expression for lowly expressed lncRNAs, we classified those that had less than 3 TPMs in a single transcriptome as Low Confidence (SM LC) and the remaining lncRNAs as High Confidence (SM HC) (Dataset S2). The single transcriptome threshold was used as there are numerous tissues (carpel, flower pedicel, petal, petiole, pollen, sepal, septum, stamen, stem internode, stem node, and valve) for which we only have a single transcriptome (Dataset S1). Additionally, we assessed the coding potential of the lncRNAs identified by the SM using CPAT 100 . We found they had significantly lower coding potential scores than coding genes in Araport11 (Fig. S4a) and that the large majority of them were classified as either non-coding or high confidence non-coding by CPAT (Fig. S4b). As expected, the identified lncRNAs have fewer exons per transcript (median 1; average 1.23) (Fig. S5a) than coding genes (median 4; average 6). Furthermore, their mature transcripts are smaller (average 437.3 nt) than that of their coding counterparts (average 1799 nt) (Fig. S5b). These characteristics coincide with what has been previously observed in animals 5,7,8,101 , flies 102 and other plants 68,[103][104][105][106] .
The total of lncRNAs annotated by the SM (6764 genes) outnumbers the most prominent databases in A. thaliana: GreenNC (v1.12) has 2752 genes (3008 transcripts) 38 , CANTATAdb (v2.0) 4080 genes (4373 transcripts) 41 and Araport11, 3559 genes (3970 transcripts) 40 . A comparison with these databases revealed that 3772 lncRNAs genes in our annotation are novel and have not previously been reported in any of these databases (Fig. 1b); the new lncRNAs were categorized into 2326 lincRNAs (2454 transcripts), 1218 NATs (1227 transcripts), 111 sense-exonic (124 transcripts) and 145 intronic (146 transcripts). These new lncRNAs represent a 93.08% (2275 over 2444) increase in the number of lincRNAs and a 134.70% (1502 over 1115) increase in NATs, with respect to the Araport11 database. Additionally, we find that 398 lncRNA genes of lncRNAs are shared between our annotation and the GreeNC database 38 , 1485 with CANTATAdb 41 , and 2637 with Araport11 40 , being the Araport11 database the one with the best agreement with our data; our annotation contains approximately 74.09% (2637 over 3559) of the lncRNAs annotated in Araport11 (Fig. 1b).
Surprisingly, only 130 lncRNAs are shared between GreeNC, CANTATAdb, and Araport11 databases, and there are only 42 lncRNAs shared among the four annotations (Fig. 1b). It is important to note that there are likely other lncRNAs in A. thaliana that are not identified in our analysis, since not all conditions, tissues, and developmental stages have been surveyed using RNA-Seq. However, our annotation is the first to take advantage of most of the transcriptomic data available for this species, ensuring that the sequences obtained are only those of expressed lncRNAs. This, combined with a robust annotation method, avoids redundancy with other types of transcripts that are not lncRNAs.
It is worth noting that within our annotation, 48 lncRNA genes (120 transcripts) have an ambiguous annotation, as they are simultaneously annotated as NAT and sense-exonic (33 lncRNA genes; 93 transcripts), lincRNAs and sense-exonic (5 genes; 12 transcripts), lincRNAs and NAT (9 genes; 13 transcripts), and intronic and sense-exonic (1 gene; 2 transcripts) (Dataset S2). Specifically, in the case of lncRNAs annotated as NAT sense-exonic, they overlap two different protein-coding genes, thereby acquiring a separate annotation for each gene. Similarly, other lncRNA genes had isoforms in different categories, depending on the genomic location of each isoform.
Expression patterns of lncRNAs. In addition to annotating lncRNAs, we leveraged the transcriptomic information to explore how many lncRNAs were expressed amongst A. thaliana tissues, developmental stages, and conditions (Fig. 2). We found more lncRNAs expressed in flower, root, seedling, and silique (Fig. 2a). www.nature.com/scientificreports/ Organs with higher cell-type diversity, such as flowers, silique, roots, seedlings and leaves had a higher number of lncRNAs (Fig. 2a). This tendency has been previously observed in animals, where organs with more diversity  www.nature.com/scientificreports/ of cell types, such as the brain, express more lncRNAs [112][113][114] . Reproductive tissues are also known to host a greater diversity of lncRNAs. Similarly, in our data, flowers have more lncRNAs than other organs (Fig. 2a). Interestingly, the number of lncRNAs expressed in the flower is much higher than in its individual parts (stamen, sepal, petal, carpel, and pedicel), further suggesting the high tissue and cell-type diversity of this organ may be due to the multiple tissues that make up this organ. An enrichment of lncRNAs in reproductive tissues has been previously reported in multiple plant species such as soy, corn, and rice 34,115,116 and animal testis 9,57,117,118 . Another category that stands out for its number of expressed lncRNAs is seedlings (Fig. 2a), composed of a mixture of tissues in a particular developmental stage. As most of the transcriptomes from abiotic stress conditions used in this study were from seedlings, many lncRNAs expressed in response to these stresses are expressed in and thus assigned to seedlings (Fig. 2a). Also, the number of transcriptomes and the sequencing depth in each category correlates positively with the number of lncRNAs found (Fig. 2a).
In terms of development, the germination phase (stages 1) has the highest number of lncRNAs (Fig. 2c), followed by stage 4 (flower development) (Fig. 2), and does not appear to correlate with the number of transcriptomes in each developmental stage (Fig. 2c). Developmental stages where tissue differentiation or organ formation occur tend to express multiple lncRNAs in both plants 6,119-121 and animals 8,46,57,113 . Unfortunately, the early stages of tissue differentiation are not represented in our data set, which could help us identify lncRNAs that participate in tissue formation.
Tissue and stage-specific lncRNA expression. Genes specifically expressed in a particular tissue or stage of development may be important for establishing the identity of that tissue or stage. We found that lncRNAs in A. thaliana, as in most organisms, are expressed in a more tissue-specific manner compared to coding genes (Fig. S2b). The embryo and the whole adult plant had the highest amount of unique lncRNAs, while the root and the embryo expressed more unique coding genes (Fig. 2a). Interestingly, despite not being the most abundant in lncRNAs (Fig. 2a), the embryo has the highest number of unique lncRNAs. Also, the root expressed most of the unique coding and lncRNAs genes (Fig. 2a). Interestingly, there were many more unique coding genes in the root and almost no unique lncRNAs expressed (Fig. 2a). We did not observe an increase in unique genes in tissues with various stress conditions. Also, most unique lncRNAs are expressed in the reproductive phase of the plant rather than in the vegetative phase (Fig. 2c). Dividing the lncRNAs by biotype, we find that 67.1% LncRNAs with known tissue-specific functions. Some lncRNAs with known functions display a high tissue-specificity measured by Tau that agrees with their reported functional tissue (Table S2). Among these, we find lncRNAs IPS1 and At4, which have functions related to phosphate starvation 13,37 , and the lncRNA MARS, which is involved in changes of the chromatin conformation in response to ABA 22 . As expected, these three lncRNAs have high tissue-specificity values in root tissues (Dataset S4). In addition, the lncRNA FLINC, related to the regulation of flowering 12 , is specifically enriched in the SAM. On the other hand, the tissue-specific expression of some known functional lncRNAs does correspond to the tissue where they are reported to function. Such is the case of HID1, a lncRNA involved in hypocotyl elongation 11 , which has high tissue-specificity in the SAM (Table S2), despite being previously found to be ubiquitously expressed 11 . Similarly, APOLO, which participates in lateral root development in response to auxin 16,120 , has high tissue-specificity in the petiole (Table S2). This discrepancy is likely due to the lack of auxin-treated roots in our dataset, which is where we expected to see the highest APOLO Tau values.
Co-expression of lncRNAs with coding genes. To infer a possible function for all annotated lncRNAs, we used a so-called guilty-by-association approach. To this aim, we constructed a co-expression network including all coding and non-coding genes using WGCNA 96 . A total of 224 transcriptomes with 34,937 genes were analyzed to construct this network.
We obtained a total of 45 co-expression modules (Fig. 3). 1425 (21%) lncRNA genes were found in 44 of the 45 co-expression modules. Overall, 516 lincRNAs, 746 NAT, 104 sense-exonic, and 59 intronic were co-expressed. Module 1 harbored the most lncRNAs, with 383 of them, primarily NATs (290), followed by lincRNAs (79), 13 sense, and one intronic lncRNA (Fig. 3). According to the functional enrichment for biological processes, this module stood out for processes related to photosynthesis, the organization of chloroplasts, and response to light. The next modules with the highest number of lncRNAs are modules 4 and 3; these modules are related to the processing and transcription of RNA. In total, 91% (40/44) of modules that housed lncRNAs presented functional enrichment for biological processes. Interestingly, 746 novel lncRNAs were co-expressed with coding genes and distributed amongst 40 modules. Modules 1, 3, 4, and 6 had the most newly annotated lncRNAs, most of them NAT lncRNAs (Fig. S7). It is worth noting that most novel lncRNAs (3026, 80.2%) in our annotation were not co-expressed with coding genes.
We found that the largest number of coexpressed lncRNAs are in the functional category enriched for chloroplast organization and photosynthesis, with positive expression eigengene values in organs related to photosynthesis such as leaves, cotyledons, and hypocotyls. These lncRNAs are divided into four modules (Fig. S8) related to more specific functions such as response to radiation (response to red light, high-intensity light) (module 1), chloroplast and plastid organization (modules 1, 32, and 33), response to cold (modules 1 and 36) and seed, embryo development (modules 32, 33 and 36). The following functional category where we find numerous lncRNAs is related to RNA regulation and its transcription. This category comprises four modules (Fig. S9) with functions such as mRNA metabolic process (module 4), RNA processing (module 3), and regulation of gene expression (modules 16,17). Genes in these functional categories are most highly expressed in embryos, SAM, and plant callus. The expression profile in this functional category is very similar to the function category of cell division (5 modules) (Fig. S12), which has positive expression values in embryos, seeds, SAM, and roots. The group of modules with fewer lncRNAs is enriched in genes that participate in the response to abiotic conditions (Fig. S16). However, many modules (modules 1, 2, 5, 6, 14, 19, 23, 36, and 45) are enriched in genes that participate in other stress responses (such as drought and cold). Still, they were classified in other functional groups, such as root development (Fig. S10).
For example, in Module 7, we identify 15 lncRNAs highly expressed in root, root tip, plant callus, and seedlings (which include root tissues) and appear to be upregulated in response to limited phosphate conditions (Fig. 4). Amongst the genes in this module we find ERF71, a transcriptional activator involved in root development 122 ; FRO2, which is involved in root growth in response to lack of iron 123 ; NRT21, a repressor of lateral root initiation in response to low nitrate or high sucrose conditions 124 ; MYB93, a transcription factor that acts as a negative regulator of lateral root formation 125 , Aux/IAA proteins, which function as repressors of early auxin response genes 126 ; MiZ1, which plays a role in lateral root development by maintaining auxin levels and negatively regulates sensitivity to cytokinins 127 among several others (Dataset S5). Indeed, this module is highly enriched in transmembrane transport and root system development processes.
Several functionally characterized lncRNAs belong to specific functional categories. For example, the DRIR, At4, and APOLO lncRNAs are found in the group of modules related to root function and stress response. It is known that DRIR regulates the closure of stomata in drought 20 , At4 is associated with the response to phosphate deficiency 128 , and APOLO is a regulatory lncRNA that directly controls its neighboring gene PID and a many of independent genes by DNA association in response to auxin 16,129 . The functions of these lncRNAs fit what we observed in the functional enrichment of the modules where they are found. In addition to these examples, we have some others in the group of chloroplasts and photosynthesis, such as FLORE 24 . This lncRNA has been identified as an important factor in the photoperiod. The lncRNA FLINC, identified as a mediator of flowering in response to temperature 12 , is found in the group of RNA regulation and transcription functions (Dataset S5).

Discussion
Here, we generated a new and improved annotation of lncRNAs in A. thaliana, supported by 224 transcriptome datasets (Dataset S3) obtained from 24 organs (parts of the plant), 11 conditions, and 5 developmental stages (20 timepoints) (Dataset S3). We found 6764 lncRNAs genes (7070 transcripts), including 3772 novel lncRNAs (Fig. 1b). Among our annotated lncRNAs, we identified 58 genes (86 transcripts) of lncRNAs experimentally validated in A. thaliana from the EvlncRNAs database 130 , which supports our ability to identify functionally relevant lncRNAs by leveraging existing publicly available transcriptomes. Given our much cohort of transcriptomic evidence, we find few lncRNAs shared with databases such as GreeNC (398 lncRNAs genes shared) 38 and CANTATAdb (1485 lncRNAs genes shared) 41 , and about 74.09% of the lncRNAs in Araport11 are found in our annotation 40 Importantly, our curation approach helped us identify several lncRNAs that were erroneously annotated as coding genes including the lncRNA IPS1, an experimentally validated lncRNA with multiple target sites for miR399, induced in the absence of phosphate 13 . Another example is its paralog At4, which presents functional redundancy with IPS1 37 . Although these two lncRNAs are functionally conserved in tomato (Lycopersicon esculentum L.) (lncRNA TPSI1) 107 , Medicago truncatula (lncRNA Mt4) 108 , rice (lncRNA OsIPS1) 109 and barley (HvIPS1) 110 . Similarly, the well-characterized lncRNA APOLO, which regulates lateral root development 16 , is annotated as a protein-coding gene. The experimentally characterized lncRNAs HID1 11 , MARS 22 , and DRIR 20 were also erroneously classified ( Table 2). Given the wide usage of the Araport11 database, we recommend a revision of their annotations based on our results.
The hundreds of transcriptomics datasets we used allowed us to analyze the abundance of lncRNAs in the different tissues and the development stages. Our analysis revealed that organs with more cell-type diversity display the highest number of lncRNAs in A. thaliana (Fig. 2a). This pattern is particularly prominent in organs related to reproduction (flower & silique), similarly to previous reports in multiple animals 9,57,117,118 and plant species 34,115,116 .
We find that the depth and the number of the transcriptomes are the experimental factors that most affect our capacity to identify novel lncRNAs in any given sample, similarly to previous annotation efforts in various species 116,131,132 . Thus, we recommend having higher sequencing depth to expedite the discovery of lncRNAs. One limitation of our study is the lack of data from stages where tissue differentiation occurs in the plant, including the flower formation and embryonic stages and the formation of the gametes-surveying these biological conditions would be essential to help complete the catalog of A. thaliana lncRNAs and further our understanding of the role of lncRNAs in the formation of plant structures.
In animals, organ formation and differentiation primarily occur in the embryonic stage, while in plants, it occurs not only in the embryonic phase but also in germination and flower development. It has been shown that widely expressed and conserved lncRNAs are expressed during tissue development, which have the highest probability of being functional. As the tissue matures, an increasing number of species and organ-specific lncRNAs are more likely to be non-functional 46 .
We find that the expression of lncRNAs is significantly more specific than the expression of coding genes. Nearly 62% (4188) of lncRNAs have expression profiles restricted to a specific tissue or stage, while only 45.6% (12,638) of proteins are specific to a particular tissue or stage. This finding agrees with previous reports in A. thaliana 4 and other species 1,9,133,134 . Moreover, most sense-exonic, NATs and lincRNAs displayed high tissue specificity, while intronic lncRNAs had the lowest tissue specificity, overall very similar to the tissue specificity of protein coding genes.
We found 1241 co-expressed lncRNAs, which we could associate with our broad functional categories (Figs. 3, S8-S16). Using this approach, we find functional categories involving lncRNAs similar to those previously reported in both A. thaliana and other plant species. For example, we find 70 A. thaliana lncRNAs distributed in modules 5, 21, 18, and 39, all functionally enriched in coding genes associated with drought. Numerous lncRNAs are involved in this response in plants 135 , including 664 lncRNAs in maize 136 , 51 in cassava, 1096 and 126 in a drought-resistant variety of Brassica napus 137 . Similarly, we identified five modules with 72 lncRNAs related to response to pathogens (Fig. S14). lncRNAs have been previously found to be differentially expressed in response to infection in tomato 138 and maize 139 . However, the functions that we can assign to lncRNAs are limited by our set of transcriptomes; we can only identify enriched biological functions in the tissues and conditions available in our panel. This analysis could be improved by including more transcriptomes in the future.
Previous works have already established the relationship between lncRNAs and processes related to photosynthesis in A. thaliana and rice 140 , as well as in the response to different types of light 18,141 . Photosynthesis is arguably the most important biological pathway in plants. Our results show that the function with the highest number of lncRNAs is related to chloroplast organization and response to light (Fig. S8); this indicates that a large number of lncRNAs may be involved in these processes. It is worth noting that most of our data were obtained from photosynthetic tissues and seedlings, which may explain why our largest modules, comprising the majority of lncRNAs, are associated with photosynthetic processes.
We also identified lncRNAs co-expressed with genes involved in root development and root response to multiple environmental stimuli (Fig. 4). lncRNAs have previously been shown to participate in root differentiation and response to different stress conditions in A. thaliana 16,129,[142][143][144] , Medicago truncatula, where 5561 lncRNAs change their expression in the root due to osmotic stress 145 , and in Populus, where 295 lncRNAs change their expression during root development 119 .
Surprisingly, several previously characterized lncRNAs, including ELENA1, MARS, COOLAIR, IPS1, and HID1, are not associated with any particular module. This might be partly because some of these lncRNAs perform their regulatory function in specific environmental conditions (e.g., prolonged cold in the case of COOLAIR), poorly represented in our transcriptomic panel 31 . Furthermore, the functional association via co-expression is not a fail-proof method; it only identifies lncRNAs expressed in most tissues sampled and www.nature.com/scientificreports/ that have a strong expression association with genes with similar functions. Thus, many other novel lncRNAs reported in our annotation with no functional association may have important functions that this approach could not identify. We hope this highly curated, transcriptomic informed lncRNA annotation with functional associations via co-expression in A. thaliana becomes a valuable resource to the A. thaliana and the plant lncRNA community. In the future, we want to assess if the functional association relationships between lncRNAs and other RNAs are conserved in different species and how their loss or gain might be associated with the loss or gain of particular traits in this and other plant families.

Data availability
Raw datasets, software, and documents are available under a CC-BY license at Github 146 and FigShare (see Supplementary Information) and NCBI (PRJNA765039).